Overview

The project will focus on TCGA RNA seq data analysis and machine learning techniques on Liver Hepatocellular Carcinoma patient data to find genes that are ultra-sensitive to cancerous genes, i.e. genes that drastically vary in concentration with respect to small changes in the concentration of the cancerous genes. Priority will be given to the genes that play a major role in endothelial membrane receptors since they will most likely affect nanoparticles’ pharmacodynamic and pharmocokinetic properties in the blood vessel.

Introduction

With the accelerating development in the application of Nanomaterial in medicine, next generation pharmacodynamics and pharmacokinetic modeling and simulation can accelerate the design, validation and translation of targeted nanoparticles by achieving selective targeting of affinity-ligand coated nanoparticles. Within the last 10 years, the number of FDA approved nanomaterials have increased from 2 to almost 50 different types. The types of nanoparticles can range from inorganic to organic, which results in numerous ways to perform drug discovery, delivery and development. Case studies of companies and technology in this market space reveals a three-part structure for commercialization of nanoscale drug delivery platforms: A technology to focus on (DNA cages, liposomes, etc.), a specific pipeline where they combine their developed drug and the nanomaterial to produce a final product and finally, a disease to target. One of the main challenges revolves around the design and validation of the nanoparticle and the drug for optimal performance and therapeutic efficacy. This can be facilitated by developing a qualitative and exhaustive computational platform for modeling the targeting of functionalized nanocarriers to optimize experimental design protocols for drug delivery.

The goal of this research is to extend the previously developed physiologically-based pharmacokinetic modeling of nanoparticle based drug delivery system to better model the experimental observations (Part 1) and to integrate TCGA genetic analysis and the pharmacodynamic simulation to reflect targeted membrane protein expression in human cancers (Part II). This R project will focus on Part II of the research while Part I will be coded in python.

Library setup:

library("DESeq2")
library(rbokeh)
library(limma)
library("pheatmap")
library("RColorBrewer")
library(ggfortify)
library(GenomicDataCommons)
library(ggplot2)
library(knitr)
library(kableExtra)
library(jsonlite)
library(listviewer)
library(survival)
library(rnaseqGene)
library(devtools)

TCGA Dataset:

In TCGA database, there are 4 metadata projects(), cases(), files() and annotations().

Projects Analysis:

The JSON table gives very good insight into how the complicated database is structured from the nodes of the metadata.

paste("Total number of projects available in TCGA = ", GenomicDataCommons::count(projects())) 
## [1] "Total number of projects available in TCGA =  40"
paste ('Number of default fields = ', length(default_fields('projects')))
## [1] "Number of default fields =  10"
paste ('Number of available fields = ',length(available_fields('projects')))
## [1] "Number of available fields =  22"
# Table of dafault field in projects:
x <- data.frame(default_fields('projects'))
x
##    default_fields..projects..
## 1      dbgap_accession_number
## 2                disease_type
## 3       intended_release_date
## 4                        name
## 5                primary_site
## 6        project_autocomplete
## 7                  project_id
## 8                  releasable
## 9                    released
## 10                      state
kable(x, "html") %>%
  kable_styling(bootstrap_options = c("striped","hover", "condensed"), full_width = F, position = "left")
default_fields..projects..
dbgap_accession_number
disease_type
intended_release_date
name
primary_site
project_autocomplete
project_id
releasable
released
state
# Table of all available field in projects:
kable(data.frame(available_fields('projects')), "html") %>%
  kable_styling(bootstrap_options = c("striped","hover", "condensed"), full_width = F, position = "left")%>%
  scroll_box(width = "500px", height = "300px")
available_fields..projects..
dbgap_accession_number
disease_type
intended_release_date
name
primary_site
program.dbgap_accession_number
program.name
program.program_id
project_autocomplete
project_id
releasable
released
state
summary.case_count
summary.data_categories.case_count
summary.data_categories.data_category
summary.data_categories.file_count
summary.experimental_strategies.case_count
summary.experimental_strategies.experimental_strategy
summary.experimental_strategies.file_count
summary.file_count
summary.file_size
# JSON Table of default project fields: 
all_field_project <- projects()%>%results_all()
jsonedit(all_field_project,mode = "view")
# JSON Table of Disease type of all the projects and sub projects: 
jsonedit(all_field_project$disease_type,mode = "view")    

cases() Analysis:

# Projects and No. of cases:
res = cases() %>% facet("project.project_id") %>% aggregations() %>% data.frame()
colnames(res) <- c("Project", "NoOfCases")
kable(res, "html") %>%
  kable_styling(bootstrap_options = c("striped","hover", "condensed"), full_width = F, position = "center")%>%
  scroll_box(width = "200px", height = "500px")
Project NoOfCases
FM-AD 18004
TARGET-NBL 1127
TCGA-BRCA 1098
TARGET-AML 988
TARGET-WT 652
TCGA-GBM 617
TCGA-OV 608
TCGA-LUAD 585
TCGA-UCEC 560
TCGA-KIRC 537
TCGA-HNSC 528
TCGA-LGG 516
TCGA-THCA 507
TCGA-LUSC 504
TCGA-PRAD 500
TCGA-SKCM 470
TCGA-COAD 461
TCGA-STAD 443
TCGA-BLCA 412
TARGET-OS 381
TCGA-LIHC 377
TCGA-CESC 307
TCGA-KIRP 291
TCGA-SARC 261
TCGA-LAML 200
TCGA-ESCA 185
TCGA-PAAD 185
TCGA-PCPG 179
TCGA-READ 172
TCGA-TGCT 150
TCGA-THYM 124
TCGA-KICH 113
TCGA-ACC 92
TCGA-MESO 87
TCGA-UVM 80
TARGET-RT 75
TCGA-DLBC 58
TCGA-UCS 57
TCGA-CHOL 51
TARGET-CCSK 13
# Bar plot of Projects and No. of cases: 
figure(width = 900, ylab = "No. Of Cases", xlab = "Projects", title = "Project~Cases(All)") %>% ly_bar(Project, NoOfCases, data = res,hover = res$NoOfCases ) %>% theme_axis("x", major_label_orientation = 90)
# Bar plot of Projects and No. of cases without the FM-AD Project: 
res <- res[-1,]
figure(width = 900, ylab = "No. Of Cases", xlab = "Projects", title = "Project~Cases(w/o FM-AD)") %>% ly_bar(res$Project, res$NoOfCases, hover = res$NoOfCases) %>% theme_axis("x", major_label_orientation = 90)
# No. of cases in the HCC Liver cancer project: 
cases()  %>% GenomicDataCommons::filter( ~ project.project_id == 'TCGA-LIHC'  ) %>% GenomicDataCommons::count()
## [1] 377

files() Analysis:

#Different file type:
q = files() %>%
    GenomicDataCommons::select(available_fields('files')) %>%
    GenomicDataCommons::filter(~ cases.project.project_id=='TCGA-LIHC' &
             data_type=='Gene Expression Quantification')
q %>% facet('analysis.workflow_type') %>% aggregations() %>% kable()%>%
  kable_styling(bootstrap_options = c("striped","hover", "condensed"), full_width = F, position = "center")
key doc_count
HTSeq - Counts 424
HTSeq - FPKM 424
HTSeq - FPKM-UQ 424
# Total number of HiSeq files in Liver Hepatocellular Carcinoma
files()  %>% GenomicDataCommons::filter( ~ cases.project.project_id == 'TCGA-LIHC' & analysis.workflow_type == 'HTSeq - FPKM-UQ') %>% GenomicDataCommons::count()
## [1] 424
# Data type of files available in TCGA dataset: 
files() %>% facet('data_type') %>% aggregations()%>%kable()%>%
  kable_styling(bootstrap_options = c("striped","hover", "condensed"), full_width = F, position = "center")%>%
  scroll_box(width = "400px", height = "500px")
key doc_count
Annotated Somatic Mutation 63580
Raw Simple Somatic Mutation 63580
Aligned Reads 45985
Gene Expression Quantification 34713
Slide Image 30036
Biospecimen Supplement 25151
Copy Number Segment 22376
Masked Copy Number Segment 22376
Clinical Supplement 12496
Methylation Beta Value 12359
Isoform Expression Quantification 11488
miRNA Expression Quantification 11488
Raw CGI Variant 435
Aggregated Somatic Mutation 186
Masked Somatic Mutation 132
# Analysis Workflow_type available in TCGA dataset:
files() %>% facet('analysis.workflow_type') %>% aggregations()%>%kable()%>%
  kable_styling(bootstrap_options = c("striped","hover", "condensed"), full_width = F, position = "center")%>%
  scroll_box(width = "400px", height = "500px")
key doc_count
DNAcopy 44752
BCGSC miRNA Profiling 22976
BWA with Mark Duplicates and Cocleaning 22893
FM Simple Somatic Mutation 18004
FoundationOne Annotation 18004
Liftover 12359
STAR 2-Pass 11604
HTSeq - Counts 11571
HTSeq - FPKM 11571
HTSeq - FPKM-UQ 11571
BWA-aln 11488
SomaticSniper 11398
SomaticSniper Annotation 11398
MuTect2 11396
MuTect2 Annotation 11396
VarScan2 11395
VarScan2 Annotation 11395
MuSE 11387
MuSE Annotation 11387
VCF LiftOver 435
MuSE Variant Aggregation and Masking 69
MuTect2 Variant Aggregation and Masking 69
SomaticSniper Variant Aggregation and Masking 69
VarScan2 Variant Aggregation and Masking 69
FoundationOne Variant Aggregation and Masking 42
_missing 67683

Initial Data Wragling for Data Analysis:

Raw RNAseq dataset analysis:

The LIHC (Liver Hepatocellular Carcinoma) project’s RNAseq and clinical folders are downloaded and named RNA and Clinical respectively. A sample of RNAseq dataset with first 5 rows and 5 columns.

mRNAseq <- read.table('RNA/LIHC.rnaseqv2__illuminahiseq_rnaseqv2__unc_edu__Level_3__RSEM_genes_normalized__data.data.txt',nrows=20533, header=T,row.names=1,sep='\t')
mRNAseq[1:5,1:5]
##             TCGA.2V.A95S.01A.11R.A37K.07 TCGA.2Y.A9GS.01A.12R.A38B.07
## gene_id                 normalized_count             normalized_count
## ?|100130426                       0.0000                       0.0000
## ?|100133144                       1.5051                      26.4120
## ?|100134869                       3.7074                       2.6663
## ?|10357                          90.1124                      71.0054
##             TCGA.2Y.A9GT.01A.11R.A38B.07 TCGA.2Y.A9GU.01A.11R.A38B.07
## gene_id                 normalized_count             normalized_count
## ?|100130426                       0.0000                       0.0000
## ?|100133144                       0.0000                       5.7222
## ?|100134869                       4.4833                       5.1216
## ?|10357                          95.5122                      61.6679
##             TCGA.2Y.A9GV.01A.11R.A38B.07
## gene_id                 normalized_count
## ?|100130426                       0.0000
## ?|100133144                      11.4975
## ?|100134869                       5.4230
## ?|10357                         104.4670

Removing the first row from the gene dataset (unwanted dataset):

mRNAseq <-mRNAseq[-1,]
mRNAseq[1:5,1:5]
##             TCGA.2V.A95S.01A.11R.A37K.07 TCGA.2Y.A9GS.01A.12R.A38B.07
## ?|100130426                       0.0000                       0.0000
## ?|100133144                       1.5051                      26.4120
## ?|100134869                       3.7074                       2.6663
## ?|10357                          90.1124                      71.0054
## ?|10431                        1017.1038                     639.2311
##             TCGA.2Y.A9GT.01A.11R.A38B.07 TCGA.2Y.A9GU.01A.11R.A38B.07
## ?|100130426                       0.0000                       0.0000
## ?|100133144                       0.0000                       5.7222
## ?|100134869                       4.4833                       5.1216
## ?|10357                          95.5122                      61.6679
## ?|10431                         742.4344                    1186.9807
##             TCGA.2Y.A9GV.01A.11R.A38B.07
## ?|100130426                       0.0000
## ?|100133144                      11.4975
## ?|100134869                       5.4230
## ?|10357                         104.4670
## ?|10431                         878.1726

Filtering the RNAseq data for row with low counts:

Totalgenes = paste0("Total genes = ", dim(mRNAseq)[1])

# Turn the mRNAseq data into matrix format
x <- as.matrix(mRNAseq)

# Turn each row in numeric format 
x <- t(apply(x,1, as.numeric))

# Count how many 0s are in each row in the gene matrix: 
r <- as.numeric(apply(x,1,function(i) sum(i == 0)))

# list is created with rows which have more than half of row data with no counts
remove <- which(r > dim(x)[2]*0.5)

# mRNAseq datamatrix is reduced by subtracting the remove list
mRNAseq <- mRNAseq[-remove,]

# Visualizing low count gene removal
text1 = paste0("Removed genes = ", length(remove))
text2 = paste0("Preserved genes = ", dim(mRNAseq)[1])
Totalcases = paste0("Total cases = ", dim(mRNAseq)[2])

plot1 = figure(width = c(700), xlab = "Genes with no. of 0 counts") %>%ly_hist(r,breaks = 50) %>%
  ly_abline(v = dim(x)[2]*0.5) %>%
  ly_text(x=230, y=14000, text=text1, color = c("red"))%>%
  ly_text(x=30, y=14000, text=text2, color = c("red")) %>%
  ly_rect(250, 6000, 400, 8000, fill_alpha=0, line_color = c("black"))%>%
  ly_text(x=255, y=7000, text = Totalgenes)%>%
  ly_text(x=255, y=6000, text = Totalcases)

plot1

Here we can see that the genes that had counts in more than 50% of the patient population get preserved (16910 genes) and the other 423 genes are removed from the dataset since they have 0 counts in more than 50% of the patient population.

Transformation of dataset for genetic analysis:

Here two different data transformation technique are performed to examin and select which to proceed with for further analysis. The two data transformation techniques are:

Visualization of raw reduced dataset:

rna <- mRNAseq
rna <- as.matrix(rna)
rna <- apply(rna,1,as.numeric)
rna <-t(rna)
# Density plot of Mean and SD of column of 0 filtered matrix (Between samples): 
figure(ylab = "Stan. Dev. of Columns(mRNAseq)", xlab = "Means of Columns(mRNAseq)", title = "Standard Deviation vs Mean of Columns of mRNAseq") %>% ly_hexbin(colMeans(rna), colSds(rna)) 
# Density plot of Mean and SD of row of 0 filtered matrix (Between Genes): 
figure(ylab = "Stan. Dev. of Rows(mRNAseq)", xlab = "Means of Rows(mRNAseq)", title = "Standard Deviation vs Mean of Rows of mRNAseq") %>% ly_hexbin(rowMeans(rna), rowSds(rna)) 
# Plotting gene1 vs gene2 across samples: 
plot(rna[,1],rna[,2],pch=16, cex=0.3)

# QQ Plotting gene1 vs gene2 across samples: 
qqplot(rna[,1],rna[,2],pch=16, cex=0.3)

Looking between samples, the range of SD and Mean vary over 6 and 5 magnitude of order respectively and the same goes for when looking between genes as it varies over the order of 6. Also the gene1 vs gene2 plot and the QQ plot of itself gives no useful insight as the datatpoints are very skewed towards low count genes.

Performing (Log2 + 1) transformation and getting the scatter and QQ plots for before and after transformation:

# puseudocount and log base 2 transformation of gene counts: 
log.mRNA.one <- log2(rna + 1)

# Density plot of Mean and SD of column of transformed matrix  
figure(ylab = "Stan. Dev. of Columns(Log.one.(mRNAseq))", xlab = "Means of Columns(Log.one.(mRNAseq))", title = "Standard Deviation vs Mean of Columns of transformed mRNAseq (Between samples)") %>% ly_hexbin(colMeans(log.mRNA.one), colSds(log.mRNA.one)) 
# Density plot of Mean and SD of rows of transformed matrix  
figure(ylab = "Stan. Dev. of Rows(Log.one.(mRNAseq)", xlab = "Means of Rows(Log.one.(mRNAseq))", title = "Standard Deviation vs Mean of Rows of mRNAseq (Between Genes)") %>% ly_hexbin(rowMeans(log.mRNA.one), rowSds(log.mRNA.one)) 
plot(log.mRNA.one[,1],log.mRNA.one[,2],pch=16, cex=0.3)

qqplot(log.mRNA.one[,1],log.mRNA.one[,2],pch=16, cex=0.3)

Performing Voom tranformation and plotting the scatter plot and QQ plot after transformation:

When there is an uneven number of samples, such as uneven number of normal to cancer patients, voom function is perferred.

# Number of Normal cell samples: 
n_index <- which(substr(colnames(mRNAseq),14,14) == '1')
length(n_index)
## [1] 50
# Number of tumor cell samples: 
t_index <- which(substr(colnames(mRNAseq),14,14) == '0')
length(t_index)
## [1] 373
# Applying voom function from limma package to normalize the data
vm <- function(x){
  cond <- factor(ifelse(seq(1,dim(x)[2],1) %in% t_index, 1,  0))
  d <- model.matrix(~1+cond)
  x <- t(apply(x,1,as.numeric))
  ex <- voom(x,d,plot=F)
  return(ex$E)
}

rna_vm  <- vm(mRNAseq)
colnames(rna_vm) <- colnames(mRNAseq)
dim(rna_vm)
## [1] 16910   423
# Density plot of Mean and SD of column of 0 filtered matrix: 
figure(ylab = "Stan. Dev. of Columns(mRNAseq)", xlab = "Means of Columns(mRNAseq)", title = "Standard Deviation vs Mean of Columns of mRNAseq (Between samples)") %>% ly_hexbin(colMeans(rna_vm), colSds(rna_vm)) 
# Density plot of Mean and SD of row of 0 filtered matrix: 
figure(ylab = "Stan. Dev. of Rows(mRNAseq)", xlab = "Means of Rows(mRNAseq)", title = "Standard Deviation vs Mean of Rows of mRNAseq (Between Genes)") %>% ly_hexbin(rowMeans(rna_vm), rowSds(rna_vm)) 
figure(ylab = "Stan. Dev. of Columns(mRNAseq)", xlab = "Means of Columns(mRNAseq)", title = "Standard Deviation vs Mean of Columns of mRNAseq (Between samples)") %>% ly_hexbin(colMeans(rna_vm), colSds(rna_vm)) %>% ly_hexbin(rowMeans(rna_vm), rowSds(rna_vm)) 
plot(rna_vm[,1],rna_vm[,2],pch=16, cex=0.3)

qqplot(rna_vm[,1],rna_vm[,2],pch=16, cex=0.3)

In log2 +1 transformation and voom function, the range of SD and Mean between samples and genes are extremely small compared to the previous raw dataset analysis. This transformation lets us compare the genes and samples. Since the variation is distributed more evenly in voom normalization, this analysis will adopt the voom normalized data to perform further analysis.

Reordering the rna_vm matrix to set normal and tumor cases seperatly together:

# Reordering the matrix columns to have the normal cases first and then the tumor cases second: (this is for the heatmap plot)
rna_vm <- rna_vm[ , c(n_index, t_index)]

# just checking again, when printing both n_index and t_index, they should have a continues numbering. 
# Number of Normal cell samples: 
n_index <- which(substr(colnames(rna_vm),14,14) == '1')

# Number of tumor cell samples: 
t_index <- which(substr(colnames(rna_vm),14,14) == '0')

creating the Distance matrix between columns for heatmap:

sampleDists <- dist(t(rna_vm))
sampleDistMatrix <- as.matrix(sampleDists)

Heatmap

colors <- colorRampPalette( rev(brewer.pal(9, "Blues")) )(255)

annotdf <- data.frame(row.names = colnames(rna_vm), category = c(rep("Normal", length(n_index)), rep("Tumor", length(t_index))))
rownames(sampleDistMatrix) <- colnames(rna_vm)
colnames(sampleDistMatrix) <- c(rep("Normal", length(n_index)), rep("Tumor", length(t_index)))

pheatmap(sampleDistMatrix,
         col = colors,
         gaps_row=length(n_index),
         gaps_col=length(n_index),
         annotation_row = annotdf,
         cluster_rows = FALSE,
         cluster_cols = FALSE,
         show_rownames = FALSE,
         show_colnames = FALSE,
         annotation_names_row = FALSE)

In the above heatmap, we can see that the normal patients without cancer are more alike in terms of gene expression than within cancer patients.

autoplot(prcomp(t(rna_vm)), data = t(rna_vm), colour = c(rep("black", length(n_index)), rep("red", length(t_index))))

The above PCA plot of PC1 and PC2 datasets give a biologically relavant insight. The red dots denots cancer patients and the black dots denots normal patients. The cancer points are clusted around the origin, while the normal points are clusted on the fourth quadrant outside the cancer cluster. This tells us that normal gene expression produces almost the same variance while the variance of cancer gene expression are very random and the reason behind it could be that in Liven carcinoma cancer, there might be different sub categories of the cancer.

Heatmap

colors <- colorRampPalette( rev(brewer.pal(9, "Blues")) )(255)

annotdf <- data.frame(row.names = colnames(rna_vm), category = c(rep("Normal", length(n_index)), rep("Tumor", length(t_index))))
rownames(sampleDistMatrix) <- colnames(rna_vm)
colnames(sampleDistMatrix) <- c(rep("Normal", length(n_index)), rep("Tumor", length(t_index)))

pheatmap(sampleDistMatrix,
         col = colors,
         gaps_row=length(n_index),
         gaps_col=length(n_index),
         annotation_row = annotdf,
         cluster_rows = TRUE,
         cluster_cols = FALSE,
         show_rownames = FALSE,
         show_colnames = FALSE,
         annotation_names_row = FALSE)

Z-score matrix of mRNAseq normalized counts:

Here the Z- score for each gene in each sample is calculated by comparing the normal and cancer samples from the cases. This allows us to consider individual gene significant values instead of cumulating pvalues for all genes combined to find the significant genes. By calculating the z score, we can find if a particular gene is significantly up or down regulated for each patient.

# Calculating Z Score
scal <- function(x,y){
  mean_n <- rowMeans(y)  # mean of normal
  sd_n <- apply(y,1,sd)  # SD of normal
  res <- matrix(nrow=nrow(x), ncol=ncol(x))
  colnames(res) <- colnames(x)
  rownames(res) <- rownames(x)
  for(i in 1:dim(x)[1]){
    for(j in 1:dim(x)[2]){
      res[i,j] <- (x[i,j]-mean_n[i])/sd_n[i]
    }
  }
  return(res)
}

z_rna <- scal(rna_vm[,t_index],rna_vm[,n_index])

# Setting the rownames keeping only gene name
rownames(z_rna) <- sapply(rownames(z_rna), function(x) unlist(strsplit(x,'\\|'))[[1]])

# Dimensions of Z score for each genes in each tumor samples in terms of normal sample values 
dim(z_rna)
## [1] 16910   373
z_rna[1:5,1:5]
##   TCGA.2V.A95S.01A.11R.A37K.07 TCGA.2Y.A9GS.01A.12R.A38B.07
## ?                    0.4781493                    3.2921169
## ?                    1.4040622                    1.2730221
## ?                    0.9726639                    0.7364384
## ?                    1.2534915                   -0.0275773
## ?                    2.7221215                    2.7672378
##   TCGA.2Y.A9GT.01A.11R.A38B.07 TCGA.2Y.A9GU.01A.11R.A38B.07
## ?                   -1.1816147                    1.4699895
## ?                    1.3177359                    1.5580328
## ?                    0.3042798                   -0.9454642
## ?                   -1.1875702                    1.2429450
## ?                    1.7615871                    3.5120054
##   TCGA.2Y.A9GV.01A.11R.A38B.07
## ?                    2.2556587
## ?                    1.7531176
## ?                    1.3555924
## ?                    0.4345617
## ?                    3.6989549
r <- as.numeric(apply(z_rna,1,function(i) sum(abs(i) > 1.96)))
figure(width = c(700)) %>%ly_hist(r,breaks = 500) 
length(which(r==0))
## [1] 17

CLINICAL DATA:

# Clinical Data upload
clinical <- t(read.table('Clinical/LIHC.merged_only_clinical_clin_format.txt',header=T, row.names=1, sep='\t'))
# Dimensions of Clinical data:
library(kableExtra)

dim(clinical)
## [1] 377 407
# Sample of first 5 rows and columns of clinical data: 
temp <- as.data.frame(clinical[1:5,1:5])
kable(temp, "html") %>%
  kable_styling(bootstrap_options = c("striped","hover", "condensed"), full_width = F, position = "left") %>%
  scroll_box(width = "900px", height = "300px")
admin.month_of_dcc_upload admin.patient_withdrawal.withdrawn admin.project_code admin.year_of_dcc_upload patient.ablations
V2 1 false tcga 2016 NA
V3 1 false tcga 2016 NA
V4 1 false tcga 2016 NA
V5 1 false tcga 2016 NA
V6 1 false tcga 2016 NA
# Number of Samples in clinical data: 
length(colnames(clinical))
## [1] 407
# Number of Samples in z_rna data: 
length(colnames(z_rna))
## [1] 373
# Number of clinical cases we can use from the available:
clinical <- as.data.frame(clinical)
clinical$IDs <- toupper(clinical$patient.bcr_patient_barcode)
sum(clinical$IDs %in% colnames(z_rna))
## [1] 0
Subsetting Fields we need to perform survival analysis:

Since there are multiple fields in clinical data set and we need only specific fields for analysis, matrix will be reduced by combining those fields. In the clinical data set, we need three main fields, tumor event, death event for survival analysis and follow up event for censoring. For new tumor event, There are 8 fields which will all be combined by taking the highest value of all the fields.

# Number of columns with new tumor event data: 
ind_keep <- grep('days_to_new_tumor_event_after_initial_treatment',colnames(clinical))
length(ind_keep)
## [1] 8
# Column names of all new tumor event fields: 
kable(colnames(clinical)[ind_keep], "html") %>%
  kable_styling(bootstrap_options = c("striped","hover", "condensed"), full_width = F, position = "left") 
x
patient.follow_ups.follow_up-2.new_tumor_events.new_tumor_event-2.days_to_new_tumor_event_after_initial_treatment
patient.follow_ups.follow_up-2.new_tumor_events.new_tumor_event.days_to_new_tumor_event_after_initial_treatment
patient.follow_ups.follow_up-3.new_tumor_events.new_tumor_event.days_to_new_tumor_event_after_initial_treatment
patient.follow_ups.follow_up-4.new_tumor_events.new_tumor_event.days_to_new_tumor_event_after_initial_treatment
patient.follow_ups.follow_up.new_tumor_events.new_tumor_event-2.days_to_new_tumor_event_after_initial_treatment
patient.follow_ups.follow_up.new_tumor_events.new_tumor_event-3.days_to_new_tumor_event_after_initial_treatment
patient.follow_ups.follow_up.new_tumor_events.new_tumor_event-4.days_to_new_tumor_event_after_initial_treatment
patient.follow_ups.follow_up.new_tumor_events.new_tumor_event.days_to_new_tumor_event_after_initial_treatment

The same will be performed for the 5 fields that contain death information.

# Number of columns with days to death field: 
inddeath_keep <- grep('days_to_death',colnames(clinical))
length(inddeath_keep)
## [1] 5
# Column names of all days to death fields: 
kable(colnames(clinical)[inddeath_keep], "html") %>%
  kable_styling(bootstrap_options = c("striped","hover", "condensed"), full_width = F, position = "left")
x
patient.days_to_death
patient.follow_ups.follow_up-2.days_to_death
patient.follow_ups.follow_up-3.days_to_death
patient.follow_ups.follow_up-4.days_to_death
patient.follow_ups.follow_up.days_to_death

Similarly, for the follow up fields too.

# Number of columns with last day of follow up field: 
indfollow_keep <- grep('days_to_last_followup',colnames(clinical))
length(indfollow_keep)
## [1] 5
# Column names of all last day to follow up fields: 
kable(colnames(clinical)[indfollow_keep], "html") %>%
  kable_styling(bootstrap_options = c("striped","hover", "condensed"), full_width = F, position = "left")
x
patient.days_to_last_followup
patient.follow_ups.follow_up-2.days_to_last_followup
patient.follow_ups.follow_up-3.days_to_last_followup
patient.follow_ups.follow_up-4.days_to_last_followup
patient.follow_ups.follow_up.days_to_last_followup

Now that we know what fields are we can combine for the final analysis, a new matrix is generated by the following code:

# For New tumore event: 
new_tum <- as.matrix(clinical[,ind_keep])
new_tum_collapsed <- c()
for (i in 1:dim(new_tum)[1]){
  if ( sum ( is.na(new_tum[i,])) < dim(new_tum)[2]){
    m <- min(new_tum[i,],na.rm=T)
    new_tum_collapsed <- c(new_tum_collapsed,m)
  } else {
    new_tum_collapsed <- c(new_tum_collapsed,'NA')
  }
}

# For death events: 
ind_keep <- grep('days_to_death',colnames(clinical))
death <- as.matrix(clinical[,ind_keep])
death_collapsed <- c()
for (i in 1:dim(death)[1]){
  if ( sum ( is.na(death[i,])) < dim(death)[2]){
    m <- max(death[i,],na.rm=T)
    death_collapsed <- c(death_collapsed,m)
  } else {
    death_collapsed <- c(death_collapsed,'NA')
  }
}

# For follow up events: 
ind_keep <- grep('days_to_last_followup',colnames(clinical))
fl <- as.matrix(clinical[,ind_keep])
fl_collapsed <- c()
for (i in 1:dim(fl)[1]){
  if ( sum (is.na(fl[i,])) < dim(fl)[2]){
    m <- max(fl[i,],na.rm=T)
    fl_collapsed <- c(fl_collapsed,m)
  } else {
    fl_collapsed <- c(fl_collapsed,'NA')
  }
}

# Combining the fields together to create an analysis matrix: 
all_clin <- data.frame(new_tum_collapsed,death_collapsed,fl_collapsed)
colnames(all_clin) <- c('new_tumor_days', 'death_days', 'followUp_days')

#Dimensions of all_clin matrix: 
dim(all_clin)
## [1] 377   3
# Sample of clinical data 
kable(head(all_clin), "html") %>%
  kable_styling(bootstrap_options = c("striped","hover", "condensed"), full_width = F, position = "left")
new_tumor_days death_days followUp_days
NA NA NA
NA 724 NA
NA 1624 NA
NA NA 1939
NA 2532 NA
NA 1271 NA
Censoring for cases we cannot use in analysis:

Since some patients sometimes discontinue their treatment or for some reason, the death or tumor event is not recorded, these data are removed from the analysis since we would like to keep only the cases where a certain death event or survival event is known. Therefore, for each cases, if the follow up event is the longest date available compared to the death and tumor event, then these rows will not be considered.

# vector with time to new tumor containing data to censor for new_tumor
all_clin$new_time <- c()
for (i in 1:length(as.numeric(as.character(all_clin$new_tumor_days)))){
  all_clin$new_time[i] <- ifelse ( is.na(as.numeric(as.character(all_clin$new_tumor_days))[i]),
                    as.numeric(as.character(all_clin$followUp_days))[i],as.numeric(as.character(all_clin$new_tumor_days))[i])
}


# vector with time to death containing values to censor for death
all_clin$new_death <- c()
for (i in 1:length(as.numeric(as.character(all_clin$death_days)))){
  all_clin$new_death[i] <- ifelse ( is.na(as.numeric(as.character(all_clin$death_days))[i]),
                                 as.numeric(as.character(all_clin$followUp_days))[i],as.numeric(as.character(all_clin$death_days))[i])
}

dim(all_clin)
## [1] 377   5
all_clin$death_event <- ifelse(clinical$patient.follow_ups.follow_up.vital_status == 'alive', 0,1)

# Adding row.names to clinical
rownames(all_clin) <- clinical$IDs